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Searches for faint signals in counting experiments are often encountered in particle physics and 
astrophysics, as well as in other fields. Many problems can be reduced to the case of a model with 
independent and Poisson-distributed signal and background. Often several background contributions 
are present at the same time, possibly correlated. We provide the analytic solution of the statistical 
inference problem of estimating the signal in the presence of multiple backgrounds, in the framework 
of objective Bayes statistics. The model can be written in the form of a product of a single Poisson 
distribution with a multinomial distribution. The first is related to the total number of events, whereas 
the latter describes the fraction of events coming from each individual source. Correlations among 
different backgrounds can be included in the inference problem by a suitable choice of the priors. 
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1. Introduction 

Searches for faint signals in counting experiments are often encountered in particle physics 
(e.g. in searches for new resonances) and astrophysics (e.g. in searches for new sources of 
high-energy photons). Many practical problems can be reduced to the case of a counting 
experiment where the events come from the sum of two independent Poisson-distributed 
contributions, usually referred to as “signal” and “background”. Although an accurate 
estimate of the background is desired in most applications, the sole parameter of interest 
is typically the signal intensity. In other words, the background intensity can be viewed 
as a nuisance parameter of the statistical inference problem, whose goal it is to estimate 
the signal strength given the measured count rate. 

Sometimes the background is the result of different sources. In such cases, one has a 
model in which the signal is superimposed to the total background. If we knew which 
event comes from which source, we could keep separate lists for the number of observed 
events for each source of signal or background. However, in many practical problems, 
only fewer quantities are directly observable, and sometimes only the total number of 
observed events can be measured. An example is the number of events recorded by 
a Geiger-Miiller counter when measuring the intensity of a weak radioactive sample. 
The counts give only information about the total rate of radioactive decays, and not 
about the isotopic composition of the sample or the environment. Often one has prior 
knowledge about the individual contributions which can be used to improve the result of 
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the inference. In the example of the weak radioactive source, this could be any knowledge 
about the rate of natural radioactivity of the surrounding and the rate of cosmic muons, 
obtained from a measurement without the source. 

In this paper we address the inference problem outlined above by means of objective 
Bayesian methods. These are based on the likelihood function, on informative background 
priors, and on an objective signal prior. This work is based on Refs. [6] and [8], and 
extends their results. 

Choosing a Bayesian approach forces the scientist to be explicit about all the as¬ 
sumptions made. Some of them may be “subjective”, in the sense that they reflect the 
professional experience of the scientist. As this is validated by numerous cross checks 
and by the criticism of the colleagues, a better term might be “inter-subjective” but 
we will keep the standard terminology. On the other hand, “objective” means that the 
posterior depends only on the assumption of the model, although there is always some 
degree of (scientific) subjectivity in the choice of the model in the first place. This is 
exactly the meaning attributed to “objective” in frequentist approaches. Hence we are 
concerned here with Bayesian results that are by construction as objective (or subjective) 
as frequentist results. 

Bayesian solutions have the advantage of a very simple interpretation: one obtains a 
range of admissible values for the parameter of interest (the signal intensity) together 
with the probability that the true value belongs to that range. On the other hand, the 
true value of the parameter is not a random quantity in the frequentist approach and 
thus one can not speak about the probability that it is inside some interval. Hence, a fre¬ 
quentist solution is a statement about the probability of the data given the true value, in 
a picture in which the experiment is imagined to be identically repeated a large number 
of times. In practical situations, this is a very idealized picture, as identically repeating 
the experiment is either unfeasible or too expensive. Furthermore, when feasible it might 
also appear undesirable. For example, in high-energy physics data are collected during 
several “runs”, ultimately related to the availability of particle interactions. Colliders 
inject particles before the run starts and keep them circulating until the collision rate 
drops enough to justify a new fill. Detectors enable data taking after each fill, and keep 
running until collisions are detected or some problem forces restarting the run. In prac¬ 
tice, although to some extent all runs could be considered as identical repetitions of the 
same experiment, during data analysis they are joined to form a bigger sample, rather 
then treating them as repeated experiments. 

From a computational perspective, Bayesian methods require the calculation of mul¬ 
tidimensional integrals while frequentist methods typically adopt maximization proce¬ 
dures. The latter are typically less CPU-intensive, if those calculations are performed 
using numerical methods. However, we are concerned here with a problem that can be 
treated analytically, hence the computing time is negligible. 

Both approaches are perfectly legitimate. When addressing statistical inference, it is 
important to clearly formulate the question one wants to answer, because this implies the 
choice of the paradigm. For example, if one wants to speak in terms of the probability 
of a model (in our case, a model including contributions from a signal source) given 
the experimental result (e.g. the total count rate), then the Bayesian approach is the 
correct paradigm. On the other hand, if the question is cast in terms of how (un)likely 
the present outcome would be in an ideal infinite sequence of identical repetitions of the 
same experiment, then the frequentist approach is the choice. We perform the statistical 
inference from the Bayesian point of view because we believe that this answers the most 
typical question asked by scientists once an experimental result is made public: what is 
the probability that this or that model is (in)correct? 

Regardless of using a Bayesian or frequentist approach, the statistical model used in 
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the inference problem has to be defined carefully. Although the approach chosen in the 
following guarantees that in the asymptotic limit of a very large number of observed 
events the posterior will concentrate around the maximum of the likelihood, the most 
interesting class of problems is on the opposite side, i.e. when collecting a very low 
(possibly null) number of events. It is hence important that the general solution to this 
problem does not rely on asymptotic approximations and assumptions, e.g. by using a 
Gaussian model in the regime of small count rates. 

Considering the contributions of signal and different background sources to the count 
rate allows not only to infer on the signal-plus-background rate, but also to investigate 
the individual contributions. It is possible to model such a situation by either a product 
of independent Poisson distributions — one for each signal or background source — or by 
a product of a single Poisson distribution and a multinomial distribution. In the latter 
form we can encode correlations between the different contributions with the help of 
an informative Dirichlet prior. The inference problem may be then split into two parts: 
inference about the total intensity, and inference about the relative contributions of the 
different sources. The main result is that an analytic objective Bayesian solution can be 
obtained, which is valid for any number of measured events (even when it is very low or 
zero). 

An example for such a situation is the search for contributions of a new physics process 
at a hadron collider in a particular part of a spectrum, say, an invariant mass peak on 
top of a set of Standard Model background processes. The expected number of events 
due to background sources which are estimated based on theoretical cross-sections are 
partially correlated due to common uncertainties, for example the integrated luminosity 
of the data set or the scale of the invariant mass. These correlations can be modeled by 
suitably choosing a prior for the multinomial part. 

This paper is organized as follows. Section ^ gives details about the statistical models, 
while Section ^ describes the Bayesian inference. Section ^ concludes the paper. 


2. The statistical model 

A statistical model is the specification of the probability to observe any outcome given 
the values of the parameters. When considered as a function of the unknown parameters 
given the observed data, the model is called the likelihood function. 

We first address the case of a counting experiment with k contributions and for which 
the counts Xi,X2, ■ ■ ■ ,Xk € N are independent Poisson variables and are observed in¬ 
dependently. We recall the well known result that the likelihood function can either be 
written as a product of Poisson distributions, or as the product of a Poisson distribution 
for the total number of counts, N = Xi + X2 X^, and a multinomial distribution 

describing how the counts are partitioned across the different contributions. 

This is most often used when considering the entries in different histogram bins. The 
number of events in each bin, taken alone, follows a Poisson distribution. The shape of 
the histogram, normalized to unit area, is described by a multinomial distribution. For 
this reason, it is sometimes useful to imagine that our counts Xi,..., Xk are represented 
by a histogram with k bins. 

Later, we consider the situation when one additional contribution has a special mean¬ 
ing, that is it represents the signal. 
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2.1 k independent Poisson sources and k observables 

If Xj G N with i = 1 ,... ,k are independent Poisson random variables, the statistical 
model of the data is a product of Poisson distributions 

k 

X~[]Poi(X,|A0 (1) 

i=l 


with parameters A* G M"*", where 


Poi(X|A) = ^e-\ (2) 

The outcome of the experiment is the fc-tuple x = {xi,X2, ■.. ,Xk) of observed counts 
for each contribution, using a notation in which lowercase letters stand for actual values 
of the random variables denoted with uppercase letters. The total number of observed 
events is n = Xi. 

One the other hand, the conditional distribution of the random vector X = 
(Xi,X2 ,..., Xk), given the total number of counts n = A) ^ multinomial dis¬ 

tribution 


Mul(X |n, 77) 


ni 


X,\X 2 l ■■■Xk\ 


Vl" V2^ 


Xk 

Vk 


( 3 ) 


with the conditions 


k 




= n. 


( 4 ) 


and 


k 

= ( 5 ) 

i=i 

The parameters fj = {r]i,r]2,... ,r]k) represent the relative proportions of the Poisson 
rates: 


^ with X = '^Xj . ( 6 ) 

i=i 

The condition ([^ implies that there are only k — 1 independent shape variables in 
Eqn. and could be used to write the multinomial distribution in a less symmet¬ 
ric form, for example by replacing with 1 — Vj X^ with n — A ™ 

Eqn. ([3]). 

The binomial distribution is a special case of the multinomial distribution with k = 2 , 
Bin(y|n,.)^ ^ A(l-er-^ 

n! ^ 

= x^X2\ ^ Mul(x |n, fj) 
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with 


X = (Xi,X 2 ) = {Y,n-Y), ff= {m,rj 2 ) = 


( 8 ) 


A well-known result is that the unconditional distribution of X can be factored into 
the product 


k 

X ~ JJPoi(XilAi) = Poi(n|A) Mul(X|n,r/) (9) 

i=l 

of a single Poisson term and a multinomial distribution. The Poisson distribution gives 
the probability of observing a total number of events n, given the sum of all expected 
contributions, A. The multinomial term describes how the individual counts Xi distribute 
among the k bins. 

Let us now assume that we have measured a total of n counts, and that we want to 
infer about the model parameter using Eqn. as the likelihood function. An immediate 
consequence of Eqn. Q is that n carries no information about fj (and vice versa), but 
only about A = Yl^j=i ^j- surprise that the statistical inference about the total 

rate only requires the total number of counts. 

On the other hand, ff encodes the shape of the distribution of counts across the k 
contributions, but not the normalization: likelihood-based inferences about fj are the 
same whether one regards {xi,X 2 ,... ,Xk) as sampled from k independent Poisson sources 
or from a single multinomial process. In other words, estimates and tests of any function 
of fj will be the same whether we regard the number of counts as a random variable or 
as a fixed value. 

This means that we can separately address the statistical inference problems related 
to the total count rate and that related to the contributions from different background 
sources. The first step is to solve a problem with a single Poisson model and n total 
observed events. The second step is to solve a multinomial problem which focuses on the 
distribution of counts, given the total number n of observed counts. 


2.2 Signal in the presence of k background sources 

If a Poisson distributed signal source exists on top of k background sources, and is 
independent of the background, one of the k + 1 contributions to the observed number 
of counts assumes a special meaning (the signal) and the model can be written as 

k 

Poi(y|s) Poi(Xj|6j) = Poi(n|s -|- b) Mul(Z |n, C) = P{n, Z\s, b) (10) 

i=l 

where 


b = ^bi, X = ^Xi, n = Y + X, 

s bi 


i=l 

Z=iY,X), 


2 = 1 




_ bk 

s + Ys + b'"''s + b 


Making use of the interpretation of a histogram described earlier, the different counts 
Z can be viewed as a histogram with counts from one signal region and k auxiliary 
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measurements. 


2.3 The multi-channel signal -/- background problem 

We are now interested in the case that the data show an evidence that, in addition to 
the known “background” source with non-negative bin yields b = {bi,b 2 , ■ ■ ■ ,bk) and 
overall strength b = representing nuisance (i.e. not interesting) parameters, 

there is a non-null contribution from a “signal” source whose non-negative bin yields 
s = {si, S 2 , ■ ■ ■, Sk) and overall strength s = parameters of interest. This 

is not just the “discovery” problem, for which one would typically focus on the total 
number of observed counts, because here we are also interested in the shape of the signal 
contribution. This is likely to be the second step after a discovery phase of a new particle, 
for example when one wants to conduct a measurement of the mass or width of the new 
particle, or to distinguish between two competing models which predict different shapes 
for the distribution of some kinematical quantity. 

Even though we have 2k independent sources, we do not know if any given event has 
come from a signal or background source. Hence we only have k observables, the counts in 
k bins. Formally, we start by considering the signal random variables li, I 2 ,..., bfc G N, 
which are Poisson distributed Yi ~ Poi(sj), and the background variables Zi, Z 2 , ■ ■ ■, G 
N, which are also Poisson distributed Zi ~ Poi(6i). Next, we define the random variables 
Xi = Yi + Zi which correspond to the observable counts in the k bins. As the sum of 
Poisson variables is again a Poisson variable, we have Xi ~ Poi(sj + bi), i.e. we recover 
the notation used in section |2.1| above with Xi = Si + bi, = 1, ..., k and X = s + b. 

In terms of the variables Y and Z, if n total events have been observed the likelihood 
function for 2k bins is 


Mul(y,Z| n, si/s, .. .,Sk/s, bi/b ,.. .,bk/b) (11) 

Now we partition the counts into Y = Yli=i signal events and n — Y = 
background events (see Appendix). Conditional on Y (and on re), we can now write the 
previous multinomial as the product 

Mul(y I F, si/s,. .., Sk/s) Mul(Z I re - y, 61 / 6 ,..., hk/b) (12) 

Because we actually do not know Y, we need to sum over all possible ways of obtaining 
y signal events out of the total re counts, with the help of a binomial distribution 

Bin(yI n, e) ^ y, A(1 - s)-"^ (13) 

whose parameter e = s/(s -|- 6 ) is the probability of obtaining a signal event when both 
signal and background are active sources. 

The likelihood for 2k bins is then 


L = Poi(re I s + &) ^Bin(yI n,s/{s + h)) x 
y=o 

X Mul(y I y, si/s,..., Sfc/s) X 
X Mu\{Z \ n -Y,bi/h,... ,bk/b) 


(14) 


6 



January 20, 2017 Journal of Applied Statistics poimul 


As a cross check, we can count the degrees of freedom: 1 Poisson variable, plus 1 bi¬ 
nomial variable, plus k — 1 multinomial variables for the signal, plus k — 1 multinomial 
variables for the background. In total, we have 2k variables: s together with k — 1 signal 
yields (e.g. si,..., Sk-i) are the parameters of interest, whereas the k background yields 
bi,... ,bk are nuisance parameters. 

We would have obtained the same result, equation (14), by considering the likeli¬ 
hood (j^ as the merged version of separate signal and background sources (see Appendix), 
and dehning Xi = Si + bi, such that 


Vi = 


Si + bi 


Sj ^ bj 
s + b s + b 


(one must not forget the binomial weights due to the splitting of the n events into two 
unknown partitions). 

Finally, because we can only observe the sum of signal and background counts in each 
bin, we have to merge the latter into a set of k bins. Given the observed Xi counts in bin 
i, there are several ways in which signal and background events can give this sum. Again, 
we account for this with the help of binomial weights, Bin(y)| W, e,) with e, = Si/{si+bi). 
The full likelihood becomes 


L=Foi{n\s + b) Mul ( X 
X I ^ Bin (y 


Si -|- 6i Sk + bk 

n, --Tbr ' ^ 

S -I- 0 S -I- 0 


KY=0 

X 

X E 

T,=0|V L 


n 


Mul Y 


' s + b 

rS... 5 ) nBm(r. 


(15) 


2 = 1 


Xi, 


Si 


Si -|- bi 


that is the product of ([^ — which is the first line in ( |15[ ) — with terms accounting for 
all possible combinations of /c-tuples of signal counts and /c-tuples of background counts 
which give as a result the observed counts (Xi,..., X^) in each bin. 

In general, given the experimental result (Xi,...,Xfc), the likelihood function (15) 
is the starting point for performing statistical inference about the parameters of inter¬ 
est, which are the signal yields si,... ,Sfc. However, when k is not small this model is 
computationally very intensive, as the number of possible combinations becomes huge. 


3. Bayesian statistical inference 

In the Bayesian approach, the statistical inference relies on the use of Bayes’ theorem, 
which can be interpreted as the algorithm for updating our knowledge about a model 
and its parameters (modeled as a probability distribution) in view of the outcome of an 
experiment. The prior knowledge is encoded into the prior distribution for the parame¬ 
ters of the model, and their joint posterior distribution (obtained with Bayes’ theorem) 
reflects our new state of knowledge about them. In many cases, we are only interested 
in a subset of the parameters. One can obtain their joint marginal posterior probability 
density by integrating the full posterior over all other (nuisance) parameters. 
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3.1 Inference about k independent Poisson sources and k observables 

We first address the problem of k independent Poisson sources and k observable count 
rates. The likelihood function is given by Eqn. ([^ with lowercase letters that remind us 
that the likelihood is a function of the parameters with fixed (and known) experimental 
outcome. 

Bayes’ theorem gives the joint posterior probability density for A and if 
p{X,f]\ n, x) oc p(A I n) p{ff\n, x) 

( 16 ) 

oc [Poi(n|A) p(A)] [Mul(x |n, r/) pfrf)] 

(where the normalization constant can be found by integrating over the r.h.s.) in the 
form of a product of two posterior densities. The first corresponds to a Poisson model 
with expectation value A and prior density p(A), from which n events are generated. The 
second corresponds to a multinomial model with parameters if and prior density p{if), 
generating the observed vector x of counts. They are two inference problems which can 
be solved independently, as is shown below. 

3.1.1 Prior knowledge about the Poisson problem 

Conjugate prior. The most convenient functional form for the prior density of the 
Poisson parameter is a Gamma function [6], i.e., the conjugate prior for a Poisson model, 
Poi(n|A), 


Ga(A|a,/3) = :^A“-'e-^^ (17) 

r(a) 

with shape parameter a > 0 and rate parameter /3 > 0 (or scale parameter 0 = l/f3 > 0). 
When the prior is p{X) = Ga(A | a, 13), the posterior is 

p{X I n) = Ga(A \ a + n, f3 + 1) . (18) 

Informative prior. In the simple but frequent case in which the prior knowledge about 
A is summarized by its expectation i7[A] and variance I7[A], the Gamma parameters are 
determined with the method of moments by imposing i7[A] = a//3 and E[A] = af f3‘^. This 
gives f3 = E[A]/P[A] and a = j3 E[X\. Alternatively, one could start from the prior most 
probable value (the Gamma mode is at {a — l)//3 for a > 1) and variance, or from the 
knowledge of intervals covering given prior probabilities (e.g. 68.3% or 95% probability; 
this requires a numerical treatment to find a and /3), or from any set of conditions which 
is sufficient to determine the shape and rate parameters. 

In principle, if there is quite detailed information about the prior for A and it is known 
that the Gamma density does not correctly represent its shape, one has two options. The 
first one is to adopt a different functional form and solve Bayes’ theorem numerically, 
and the second is to find a linear combination of Gamma densities which approximates 
the prior well enough. In the latter case, the posterior will be a linear combination of 
Gamma functions. This simplifies the treatment without biasing the result, provided that 
enough terms are present in the linear combination of priors, and it is hence preferable 
to numerical integration. When n is large enough, the solution becomes lesser and lesser 
dependent on the exact shape of the priorQ hence one should consider using numerical 


^Bayesian statistics obey the maximum likelihood “principle” (which is indeed a theorem in this framework): when 
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methods or a linear combinations only if n is very small and — most importantly — 
when one is very confident that a single Gamma density is not good enough to model 
the prior. In practical problems, this happens very rarely (although it is not impossible), 
and a vague knowledge of the prior shape is typically sufficient. 

In cases where only vague prior information is available, it is recommended to assess 
the dependence of the posterior on the choice of the prior by comparing the results 
obtained with different priors, all compatible with the available information (e.g. different 
functional forms giving the same first two moments). When the posterior does not change 
appreciably, it means that n is large enough for the likelihood to dominate and that the 
result is robust. Otherwise, one has to conclude that the experiment did not yield enough 
events for obtaining a solution that does not depend on our prior state of knowledge. 
In this case, it is best to report the differences of a few key figures of merit (like the 
posterior mean and variance, or posterior credible intervals with given probability) which 
correspond to the choice of quite different priors. 

How different should the priors then be? Of course, comparing our informative Gamma 
density with a delta-function does not make any sense: there are requirements that admis¬ 
sible priors must satisfy to be considered acceptable. Luckily enough, such requirements 
are really minimal and typically reasonable, like requiring that the prior is strictly posi¬ 
tive over the entire domain and that the posterior be a proper density (i.e. it integrates 
to one). In addition, a formal procedure exists which gives us the “most distant” prior 
to our informative prior. We can call it the “least informative” prior or the “objective” 
prior, and it is provided by the reference analysis [3] based on information theory. This 
reference prior is the function which maximizes the amount of missing information about 
the system, hence it is the “most distant” choice from our informative prior. In other 
words, the reference prior encodes the minimal set of information about the problem. 
Indeed, it only depends on the description of the model itself (in the case under consider¬ 
ation here, on the Poisson distribution) and makes no use of any prior knowledge about 
the parameter other than the specification of its domain (which for a Poisson problem is 
the entire positive real line). For this reason, the reference prior shall be used when one 
wishes to adopt an “objective prior”. 

Objective prior. When assessing the dependence of the result from the prior informa¬ 
tion, it is recommended to compare against the reference posterior [3], i.e. the solution 
obtained with the reference prior 7r(A), which for the Poisson model coincides with Jef¬ 
freys’ prior: 7r(A) = The reference posterior is then 

p{X I n) = Ga(A | re -|- g, 1) (19) 

and also represents the best choice when one is required to report an “objective” result, 
or when one claims to have the minimal prior information about the Poisson parameter. 
Gompared to the informative prior, which is chosen to best encode all available prior 
information, the reference prior is the “most distant” choice from the information-theory 
point of view, as it encodes the minimal prior information. 

A simple example. In an attempt to measure the amount of background activity in 
a new lab, one performs two subsequent measurements for the same duration. For the 
first measurement, no prior knowledge about the expected rate A is assumed, hence the 
non-informative prior is chosen. The measurement yields an observed number of events 


n is arbitrarily large the shape of the prior does not matter at all, provided that it is strictly non-null in the region 
where the true value is. 
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of ni = 9, so that the posterior distribution is \ 9) = Ga(A | 9.5,1) with expectation 
value i?[A] = 9.5 and variance y[A] = 9.5. 

The posterior of the first measurement is then used as an informative prior for the 
second measurement. The number of observed events is n 2 = 12 and so the posterior 
distribution is p{\ \ 12) = Ga(A | 21.5, 2) with expectation value £'[A] = 10.75 and variance 
y[A] = 5.38. 

One obtains the same result by merging the two data sets and applying the objective 
prior to this new “experiment”. In this case, the posterior is p{\' \ 9+12) = Ga(A' | 21.5,1), 
where the new random variable \' = 2A because merging the two observations is equiv¬ 
alent to observing for a duration twice as long. It is very simple to show that, after 
changing variable, one obtains Ga(A' | a, h) dA' = Ga(A | a, 2b) dA, which means that the 
posterior for the original parameter is Ga(A | 21.5, 2), the same as above. 

3.1.2 Prior knowledge about the multinomial problem 

Conjugate prior. The conjugate prior of the multinomial model, Mul(x|n, ??), is the 
Dirichlet distribution, which means that the posterior is also a Dirichlet distribution and 
its parameters are simple functions of the prior and multinomial parameters. 

The Dirichlet distribution with concentration parameters a = (ai,..., a^) (with k >2 
and Oj > 0 for i = 1 ,..., /c) is 


Dir(^| a) = flpp ^ , (20) 

where the multidimensional Beta function has been used to write the normalization 
constant 


B{a)^ 


nil r(aj 

TiA) 


with 


k 

A = aj . 

i=l 


( 21 ) 


If the prior for the multinomial problem is p{ff) = Dir(r/| a), the posterior is a Dirichlet 
distribution with parameters x + a: 


p{f]\ n, x) = Dir(r/1 x + a) 


B{x + a) 


( 22 ) 


When k = 2, the multinomial becomes a binomial distribution. The corresponding 
conjugate prior is the Beta density, 


Be(e I ai, 02 ) = (23) 

B(ai,a2) 

which is indeed the special case of a Dirichlet distribution with k = 2, pi = e, r /2 = l —e. 

One obtains a Beta density also as the marginal distribution of any Dirichlet parameter, 
i.e. the Beta density is the result of integrating a Dirichlet over k — 2 parameters. 


Informative prior. The following properties of the Dirichlet distribution can be used 
to hnd the values of the concentration parameters which best represent all the available 
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prior information: 


E[r]i] = Oi/A 

(24) 

mode[r?i] - ^ ^ 

(25) 

yr 1 _ ai{A-ai) 

A2(A + 1) 

(26) 


(27) 


Often the parameters may be grouped into two classes: a set of parameters that are 
independent from any other parameter, and a set of parameters that have pairwise cor¬ 
relations. By grouping them correspondingly, one obtains a block-diagonal matrix in 
which the first block is actually diagonal. Parameters in the first block are then fixed by 
imposing their prior variance and expectation. For the others, one has a system of re¬ 
quirements, for example in the form of a collection of variances and covariances. As there 
may be more constraints than free variables, the prior concentration parameters may be 
found e.g. by solving a minimization problem. Alternatively, one may choose to solve the 
problem multiple times, each with a different choice of concentration parameters which 
satisfies k independent constraints. 

On the other hand, if there are less constraints than the number of parameter^ one 
may impose the “objective” choice on the unconstrained variables, i.e. equal prior con¬ 
centration parameters. This might be done in a two-step algorithm starting with an 
objective prior with concentration parameters all set to 0.8/k (see below), and by imag¬ 
ining an intermediate experiment in which only a subset of m < A: variables get updated 
to Dir(yi -|- 0.8/A :,... ,ym + 0.8/A:,0.8/A:,...,0.8/A:). The next step is to find real values 
yi,, yra such that the first m concentration parameters satisfy all available constraints. 
The result can be used as the prior for the actual experiment. 

Non-informative prior. The problem of finding an objective prior for all multinomial 
parameters was addressed by Berger & Bernardo [T] and Berger, Bernardo & Sun [3]. 
Being aware that for multidimensional models the reference prior depends on the ordering 
of the parameters, Berger, Bernardo & Sun [3] argued that, when no priority ranking is 
desired or allowed (in other words, when one wants all rji parameters to be on the same 
footing), some sort of “overall objective prior” shall be used which treats all parameters 
fairly well. A very reasonable requirement is that the marginal posterior for any given 
parameter should be as close as possible to the solution obtained with the marginal model 
and the reference prior for that single parameter. 

For the multinomial model, the reference posterior when only r]i is of interest is 
Be{rji \xi -|- 1/2,n — Xi -|- 1/2)On the other hand, if the prior Dir(77| a,a,... ,a) is 
used — using the same concentration parameter a for all multinomial r]i parameters, 
as they are treated on the same footing —, the marginal posterior for rji is instead 
Be(ryj \ Xi a,n — Xi {k — l)a). The goal is to find a such that, in an average sense, 
this result is closest to Be(r/j | Xi + 1/2, n — Xi + 1/2) for each i. Among different pos¬ 
sible approaches to this problem [2], Berger, Bernardo & Sun [3] addressed it with a 
hierarchical approach and showed that the overall objective prior is a Dirichlet density 


^We expect this case to be just an academic exercise, as in all important problems that we are aware of all 
background components are known to some extent. 

^As the multinomial with k = 2 is equivalent to a binomial distribution, this result can also be proved with the 
latter model. The reference posterior is the same Beta density, as it is shown for example in Ref. [ 3 . 
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with all identical concentration parameters, whose value is very well approximated by 
the simple choice a* = 0.8/k (just a bit lower than the intuitive choice 1/k), which gives 
an objective Dirichlet posterior with parameters Xi + 0.8/k. 

A simple example, continued. In the case of two subsequent measurements, one might 
be interested in the background stability with time. This is relevant during the prepara¬ 
tion of low-background experiments, e.g. if newly produced parts of an experiment are 
brought into underground laboratories. In our example, we could imagine to have in¬ 
stalled new equipment between the two measurements and ask ourselves if the somewhat 
larger number of counts in the second measurement suggests that the equipment has 
increased the overall background rate. One way of checking this is to look at the 2-bins 
histogram with xi = 9 and X 2 = 12 counts and quantify the departure from equipar- 
tition. A more formal procedure would be to test against the null hypothesis that the 
background rate is constant, but it is not necessary in this particular example, which is 
quite straightforward. 

To better illustrate the problem, we use first a binomial model with y = xi = 9 
successes out ofn = xi-|-X 2 = 21 total trials and look at the deviation from s = 0.5, 
where e is the ratio between the initial and the total rate. The reference posterior for e 
is [7] 


p(e|?/,n) = Be(e|9.5,12.5) (28) 

The posterior mean, mode and standard deviation are 

E[e] = 0.4318 M[e] = 0.4250 cj[e] = 0.1033 (29) 

If we start with Mul(xi,a: 2 |£, 1 — e) and use the overall objective prior Dir(e, 1 — 
e|0.4,0.4) recommended by Berger, Bernardo & Sun [3], we obtain instead the marginal 
posterior 


p(e|7/, n) = Be(e|9.4,12.4) (30) 

whose mean, mode and standard deviation are 

E[e] = 0.4312 M[e] = 0.4242 cj[e] = 0.1037 (31) 

Thus the two alternative solutions are hardly distinguishable in practice. 

The peaks are 0.726(T and 0.730 cj lower than e = 0.5, not a significant difference to 
conclude that the background level has changed between the two measurements, even 
without performing a formal test. Hence it is safe to use both measurements to estimate 
the background level in the cavern. 

An example from high-energy physies: lepton universality. Assume an experiment 
built to test lepton universality (see, e.g. Refs. [9] for the current experimental status) in 
Z-boson production which is equally sensitive to electrons, muons and taus, i.e. assuming 
equal reconstruction efficiencies, momentum resolutions and so on. Events containing Z- 
bosons are selected by measuring final states with two opposite-charge leptons and by 
requiring their invariant mass to be consistent with the mass of a Z-boson. Adding 
further stringent event-selection requirements, the resulting event sample can be made 
practically background-free. The number of observed events in this example are Ug = 17, 
= 19 and Ur = 12. Lepton universality is tested by estimating the branching ratios, 
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Figure 1. Contours of the smallest area containing 68.3%, 95.5% and 99.7% probability of the two-dimensional 
marginal posterior distribution (left), as well as the two one-dimensional posterior probability densities for rje 
(middle) and and rj^ (right). 


e.g., r]e = Uejn, where n = ne + n^ + Ur = 48, and comparing them. If they are found to 
not be equal, then lepton universality is disfavored. 


Using Eqn. (16) with a reference prior for the overall rate A and an objective prior for 


the multinomial part yields 


p(A,r/| n,f) oc [Poi(n|A) A [Mul(f |n, r/g, r/^, Dir(77e, | 0.8/3, 0.8/3, 0.8/3)]. 


The resulting two-dimensional marginal posterior, depending on r/g and rj^, is shown 
in Figj^ together with the corresponding one-dimensional projections. The expectation 
from lepton universality, i.e., r/g = = 1/3, is indicated by the asterisk and consistent 

with the observation. 


3.2 Inference about k 1 independent Poisson sources and a single 
observable 

Suppose that we have an experiment in which only a single count rate is measured which 
is known to originate from k different (and possibly correlated) background sources that 
are known to exist, and a possible additional signal source. Taken individually, each 
background source is modeled as a Poisson process with parameter bi, and the signal is 
an independent Poisson process with parameter s. 


We can write the model as we did in Section 


2.2 


above, i.e. P{n, Z\s,b), but now we 
have to consider that we do not know where the individual counts come from, i.e. we 
have to sum over all possible conhgurations of Z. The model thus becomes 


P{n,Z\s,b) —)• P{n\s,b) = E P{n,Z\s,b) 

z 

= Poi(n|s + 6)^Mul(Z|n,C) (32) 

z 

= Poi(n|s -|- b ), 


where the latter equation follows from the normalization of the multinomial distribution. 

Measuring only n simplifies the problem significantly, but also reduces the available 
information. The difference with respect to the case considered in the previous section is 
that the multinomial part has disappeared. In addition, the Poisson term now describes 
the sum of two independent sources. 

One performs inference about the signal intensity s by integrating the marginal pos- 
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terior p{s, b\n, a, /3) over b, where a and /3 describe the signal prior. The objective result 
is the reference marginal posterior for s computed in [6] 

p{s\n) (X e~^ f{s]n, a, f3) 7r{s) (33) 


where the polynomial 


f{s;n,a,l3) = 


m=0 


a + m - 1\ s’"-”* 

m ) {n — my. (1 + /3)’ 


(34) 


behaves like (s + ajP)^ when both a,/3 are very large. Using Wolfram’s Mathematica, 
another representation can be found for this polynomial, in terms of the confluent hy¬ 
pergeometric function of the second kind U{a,b,x): 


/(s; n, a, /3) = s” [(1 + /3)s]" [/(a, a + n + 1, (1 + /3)s)/n! 


(35) 


The reference prior 7r(s) computed in [6] is an improper density, proportional to the 
square root of Fisher’s information computed with the marginal model: 


|/(s)|l/2 = 




1 + /S 


E 

n=0 


[/(s;n,Q;,^)]^ 
/(s;re + l,a,/3) 


- 1 


1/2 


(36) 


Among the methods for constructing probability matching priors considered by Ventura, 
Cabras & Racugno [lOj, it comes out that only the marginal model P(n|s) = f Poi(n|s-|- 
b) Ga(6|a, /3) db provides a solution: no other pseudo-likelihood can be used, as it can be 
checked with direct calculationjf] 


The full objective Bayesian posterior Eqn. (33) for s can be computed with the help of 


the Bayesian Analysis Toolkit [5]. However, it is often possible to adopt a quick approx¬ 
imation. Despite from its complicate form (Eqn. (36)), it comes out that the reference 


prior admits a very simple limiting form that can be used in many practical problems [8j . 
It corresponds to the case of a perfectly known background, and performs very well in a 
wide portion of the (a, (3) parameter space. As a consequence, the reference marginal pos¬ 


terior Eqn. (33) can be often well approximated by a single (truncated) Gamma density 

i: 


Po{s\n) = —Ga{s + ^\n + ^,l) 


with normalization constant^ 


C = 


= 1 - 


Ga(s -|- ^|n 1) ds = / Ga(a:|n -|- 1) d® 

Ja/ti 

l{n+ |, 1 ) 

r(n-k 5) 


(37) 


(38) 


where the last fraction defines the regularized Gamma function, that is the cumulative 
distribution function corresponding to Ga(x|n-|- ^,1), in terms of the incomplete gamma 


■^Stefano Cabras, private communication. 

®There is a typo in the corresponding equation of [5] which has been fixed in the corresponding erratum. 
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Figure 2. Marginal posterior of the signal intensity of the new radioactive source. 


function j{n + ^, 1) and the ordinary Gamma function. 

Incidentally, we remark the similarity between the limiting reference posterior po('S|?^) 
from Eqn. (37) and the reference posterior (19) obtained when there is no background. 
Apart from the normalization, the former is the same as the latter with an offset — ^ in 
the origin. This shift might be interpreted as a form of “backgronnd subtraction”, as ^ 
is the prior expectation for b. Thus po{s\n) includes the particular case of no background 
and generalizes the result (19) obtained with Jeffreys’ prior. 

Note also that, although the multinomial part has disappeared, studies of the individual 
background contributions are still useful: all background sources that have a significant 
impact on the total background contribution b — also taking into account the correlations 
on their uncertainty — are relevant for the inference and the conclnsions drawn from 
the model described in Eqn. (32). One has also to be careful to consider any systematic 
effect that may have an impact on the overall background uncertainty. For example, the 
latter could be sensitive to a pair of strongly anti-correlated sources of background, even 
though their variations in opposite directions have no net effect on b. Hence they cannot 
be neglected in the analysis of systematics. In contrast those backgronnd sources, which 
do not contribnte significantly to b nor impact on its uncertainty, can be ignored in the 
inference about s. 


A simple example, continued. After the background radioactivity in the new lab has 
been estimated by making nse of the results of two experiments, a potentially radioactive 
source is brought in, to estimate its intensity. The signal that we want to estimate is 
the intensity of the new source, and the measurement is performed in a laboratory in 
which we know already that some background radioactivity is detectable. For simplicity, 
the measurement is conducted in the same time interval as the previous measurements. 
It yields ns = 17. Using the reference prior for the signal and the posterior of the 
second measurement of the background as informative prior, from Eqn. (37) one gets 
the posterior distribution po(s|17) oc Ga(s + ^^|17.5,1) which is shown in Fig. 2l The 
posterior peaks at s = 5.75, while the median (mean) for the signal is at 6.6 (7.0), and 
the central 68.3% credible interval around it (dehned by the 15.85% and 84.15% posterior 
quantiles) is [3.0,11.0]. The 95% upper bound on the signal is s < 14.2. 
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Figure 3. Comparison of the prior (red) and posterior probabilities (black) for the signal parameter s (left) and 
the branching ratio S 2 /s (right). Also shown in the posterior probability density for the branching ratio si/s 
(middle) including the smallest intervals containing 68.3%, 95.5% and 99.7%. 


3.3 Inference about the strength of a signal distributed over several 
channels 

In case a signal is distributed over several channels and each of these channels features 


a different background level, one can use the likehood defined in Eqn. 15 to infer on the 
signal strength. As an example, we consider the decay of a particle into three different 
channels. The total number of observed events is n = 62 and they are distributed among 
the three channels as ni = 21, n 2 = 29 and ns = 12. The background contributions are 
assumed to be known from auxilliary measurements and their priors are modeled with 
Gamma densities with mean values and standard deviations of 6 i = 10±1, 62 = 6±1 and 
63 = 2 ± 0.5. We use a Dirichlet prior for the branching ratios Si/s with concentration 
parameters 0.75, 1.5 and 0.75, respectively, resulting in expectation values of 0.25, 0.5 
and 0.25 with relative uncertainties of up to about 30%. We choose a Jeffreys prior for 
the signal contribution. 

Fig. shows a comparison of the prior and posterior probabilities for the signal param¬ 
eter s (left) and the branching ratio 82/3 (right). Also shown is the posterior probability 
density for the branching ratio si/s (middle). The posterior mean and standard devi¬ 
ation of the marginalized posterior probability of s are 42.4 and 8.4, respectively. The 
mean value is close to the mode of the distribution, and, compared to the naive expec¬ 
tation of s = n — E[ 6 ] = 44, it is shifted towards smaller values due to the steeply falling 
prior probability. Note that the shift is small compared to the standard deviation. The 
mean values of the branching ratio posteriors for si/s and S 2 /S are 0.54 and 0.23, respec¬ 
tively. The standard deviation is 0.08 in both cases, in accordance with the expectation. 
As expected, the inference process has a negligible impact on the posterior probability 
densities of the background contributions. Fig. shows the smallest intervals containing 
68.3%, 95.5% and 99.7% posterior probability for the two-dimensional distributions of 
82/3 vs. 8 (left), 83/3 vs. 3 (middle), and 83/3 vs. 82/3 (right). Also indicated are the 
mean and standard deviations. All three distributions show weak anti-correlations. The 
latter distribution also shows a sharp cut-off at the physical boundary (the sum of 82/3 
and S 3 can at most be unity). 


4. Summary 

In this paper, we have derived a statistical model for counting experiments in which the 
sources contributing to the resulting count rate are potentially correlated. Using Bayes’ 
theorem and assuming non-informative and informative priors on signal and background 
contributions respectively, we have shown that the marginal posterior probability for 
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Figure 4. Smallest intervals containing 68.3%, 95.5% and 99.7% posterior probability for the two-dimensional 
distributions of S 2 /s vs. s (left), sz/s vs. s (middle), and ss/s vs. S 2 /s (right). Also indicated are the mean and 
standard deviations. 


the signal contribution can be expressed in closed form. Simple use cases in the field of 
experimental physics have been presented. 
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Appendix A. Useful properties of the multinomial distribution 

We recall here a few properties of the multinomial distribution that are important when 
considering practical problems, in particular its behavior under re-binning and partition¬ 
ing. 

The first one is about re-binning: merging random variables distributed according to 
multinomial distributions again gives a multinomial distribution. If A ~ Mul(n, ff) with 
V = ■ ■ ■,'nk), then X* = (Ai-hA 2 , A 3 ,..., Afc) ~ Mul(n,?f) where rji = Vi+m = 

(Al + A 2 )/A and r]j = for j = 2, 3,..., A: — 1. One may obtain a binomial problem 
by sequentially merging k — 2 bins. 

The second property is important if we know the total number of counts n' in a subset 
of the bins: the conditional distribution given that Xi-\-X 2 = n' and A 3 -)-.. .-j-A^ = n—n' 
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is a product of two independent multinomial pieces: 


(Xi,X2)~Mul(n', 


Vi 


m 


(^3, 




Mul(n — n', 


Vi + m V1 + V2' 




J=3 ^ J=3 


(Al) 


For example, this property tells us how to treat two histograms at the same time. 

Conversely, the third property is that the joint distribution of two independent multi¬ 
nomial random vectors has a multinomial kernal. To see this, one starts from 


X ~ Mul(X|n,r/) 


Y ~ Mul(y |n',0 


n\ 


X,\X2l ■■■Xkl 




Xk 

Vk 


k k 

i=i i=i 


n'\ 


Y 1 IY 2 I ■■■Ym'. 




'^m 1 


i=i 


= 1 


i=i 


= n 


and writes their joint distribution as 

A, y ~ Mul(A |n, ff) Mul(y |n', C) 

n! n'\ X X- 

^ AilAs! •••Afc! TiITs! ■■■YJ. ' 

Defining 

Z = {X,Y) (A3) 


Vk" 


Cl 


Yi 


^52 


Xru 


(A2) 


such that Zj = Aj for z = 1,..., A: and Zj = y)_fc for i = /c -|- 1,..., fc -|- m, one has 


fc+m 




= n + n' 


Furthermore, defining 

(A4) 

one also has 


fc+m 

E«j = i 


i=i 


such that Eqn. (A2) may be rewritten as 

2’^+’^'n!n'! 

z 


Zi! Z2! • • • Zfc+m! 


^2 




(A5) 


where we used Hjii” 2^^ = 2 ^ 3=1 = 2”'+”''. 
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This looks like a multinomial with k + m — l degrees of freedom. However, we started 
from two multinomials with k — \ and m — 1 degrees of freedom, hence must obtain a 
distribution with k + m — 2 free parameters. Eqn. (A5) gives the probability distribution 
conditional to n and n'. If, however, only the total number of counts ntot = n + n' is 
known, then one must sum Eqn. (A5) over all possible pairs of n and n' which result in 
a total count of ntot- This reduce by one the number of independent variables. 
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